Suffix Tree for a Sliding Window: An Overview

نویسنده

  • M. Senft
چکیده

The suffix tree is a very powerful data structure developed originally for string matching and string searching. It has found many applications over the time and some of them belong into the data compression field. Many of these applications need a suffix tree built for a sliding window and there exist two clever algorithms by Fiala and Greene and by Larsson that make this possible. However, as we show both approaches have flawed proofs. We remedy this situation both by explaining a simple alternative algorithm and giving a correct proof. Introduction In 1973 Weiner introduced a new powerful data structure for string matching and searching [Weiner, 1973]. This data structure is called suffix tree and has found many applications over the time. Despite loosing some ground to CDAWG [Crochemore and Vérin, 1997] and the suffix array [Manber and Mayers, 1993] lately, the suffix tree is still a very interesting data structure. Particularly interesting is the application of the suffix tree to data compression [Fiala and Greene, 1989; Larsson, 1999; Senft, 2005]. Many of them require the suffix tree to be maintained over so called sliding window [Ziv and Lempel, 1977]. Fiala and Greene developed a clever method to adapt suffix tree for a sliding window [Fiala and Greene, 1989] that was later modified by Larsson [Larsson, 1999]. These methods are very similar and have a common weakness: their correctness proofs are flawed as we will show later. We remedy this situation both by giving a correct proof and also describing a simpler working method. This paper is organised as follows: The next section reviews some necessary notation and terminology, leading to the definition of two main concepts: the suffix tree and the sliding window. The third section describes a suffix tree adaptation for a sliding window and also contains our original results. First the suffix tree construction and symbol deletion algorithms are reviewed, then a suffix tree implementation is described and edge label maintenance addressed. Two well known algorithms for edge label maintenance [Fiala and Greene, 1989; Larsson, 1999] are described and analysed and our own simple replacement given. Weaknesses in correctness proofs are shown for Fiala’s and Greene’s as well as Larsson’s algorithm and a new sound proof is given. We conclude this paper with final remarks in the last section. Concepts and Notation We omit basic string and graph-related definitions like the definition of an alphabet, symbol and prefix or root, edge and parent. We also give only informal definitions for most non-basic concepts used in this paper and refer the reader to e.g. [Senft, 2005] for details. Strings The following string-related definitions will simplify the suffix tree definition as well as the description of algorithms in further sections. String α is said to occur in string δ if there exists a position i such that the sequence of symbols beginning at position i and ending at position i+ |α| − 1 equals to string α. This sequence of characters is called an occurrence of α in δ at position i. A string is unique in δ if it occurs in δ exactly once. A right branching substring of string δ occurs at least twice in δ and at least two of these occurrences are followed by two different characters. A proper substring of string δ is either the empty string or a right branching substring of δ or a unique suffix of δ. Suffix Tree To give a suffix tree definition a standard graph terminology (cf. [Harary, 1969]) will be used. However, to simplify things a bit vertices of a tree that are not leaves will be called nodes. The suffix tree for a string δ is a rooted tree with edges labelled by nonempty substrings of δ (see Fig. 1). The strings represented in the suffix tree are exactly all substrings of δ. The root represents WDS'05 Proceedings of Contributed Papers, Part I, 41–46, 2005. ISBN 80-86732-59-2 © MATFYZPRESS

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sliding Suffix Tree

We consider a sliding window W over a stream of characters from some alphabet of constant size. The user wants to perform deterministic substring matching on the current sliding window content and obtain positions of the matches. We present an indexed version of the sliding window based on a suffix tree. The data structure of size Θ(|W |) has optimal time queries Θ(m + occ) and amortized consta...

متن کامل

Compact Directed Acyclic Word Graphs for a Sliding Window

The suffix tree is a well-known and widely-studied data structure that is highly useful for string matching. The suffix tree of a string w can be constructed in O(n) time and space, where n denotes the length of w. Larsson achieved an efficient algorithm to maintain a suffix tree for a sliding window. It contributes to prediction by partial matching (PPM) style statistical data compression sche...

متن کامل

Most Recent Match Queries in On-Line Suffix Trees

A suffix tree is able to efficiently locate a pattern in an indexed string, but not in general the most recent copy of the pattern in an online stream, which is desirable in some applications. We study the most general version of the problem of locating a most recent match: supporting queries for arbitrary patterns, at each step of processing an online stream. We present augmentations to Ukkone...

متن کامل

Lempel-Ziv Compression in a Sliding Window

We present new algorithms for the sliding window Lempel-Ziv (LZ77) problem and the approximate rightmost LZ77 parsing problem. Our main result is a new and surprisingly simple algorithm that computes the sliding window LZ77 parse in O(w) space and either O(n) expected time or O(n log logw + z log log σ) deterministic time. Here, w is the window size, n is the size of the input string, z is the ...

متن کامل

Structures of String Matching and Data Compression

This doctoral dissertation presents a range of results concerning efficient algorithms and data structures for string processing, including several schemes contributing to sequential data compression. It comprises both theoretic results and practical implementations. We study the suffix tree data structure, presenting an efficient representation and several generalizations. This includes augmen...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005